Fact Book
Consistency Is the Key: Detecting Hallucinations in LLM Generated Text By Checking Inconsistencies About Key Facts
Gupta, Raavi, Panicker, Pranav Hari, Bhatia, Sumit, Ramakrishnan, Ganesh
Large language models (LLMs), despite their remarkable text generation capabilities, often hallucinate and generate text that is factually incorrect and not grounded in real-world knowledge. This poses serious risks in domains like healthcare, finance, and customer support. A typical way to use LLMs is via the APIs provided by LLM vendors where there is no access to model weights or options to fine-tune the model. Existing methods to detect hallucinations in such settings where the model access is restricted or constrained by resources typically require making multiple LLM API calls, increasing latency and API cost. We introduce CONFACTCHECK, an efficient hallucination detection approach that does not leverage any external knowledge base and works on the simple intuition that responses to factual probes within the generated text should be consistent within a single LLM and across different LLMs. Rigorous empirical evaluation on multiple datasets that cover both the generation of factual texts and the open generation shows that CONFACTCHECK can detect hallucinated facts efficiently using fewer resources and achieves higher accuracy scores compared to existing baselines that operate under similar conditions. Our code is available here.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > Austria > Vienna (0.14)
- South America > Argentina (0.05)
- (9 more...)
- Research Report (1.00)
- Overview > Fact Book (0.43)
A critique of pure stupidity: understanding Trump 2.0
President Donald Trump holds charts as he speaks about the economy in the Oval Office, August 2025. President Donald Trump holds charts as he speaks about the economy in the Oval Office, August 2025. If the first term of Donald Trump provoked anxiety over the fate of objective knowledge, the second has led to claims we live in a world-historical age of stupid, accelerated by big tech. But might there be a way out? T he first and second Trump administrations have provoked markedly different critical reactions. The shock of 2016 and its aftermath saw a wave of liberal anxiety about the fate of objective knowledge, not only in the US but also in Britain, where the Brexit referendum that year had been won by a campaign that misrepresented key facts and figures.
- Europe > United Kingdom (0.69)
- Oceania > Australia (0.04)
- North America > United States > Utah (0.04)
- (5 more...)
Aligning LLMs for the Classroom with Knowledge-Based Retrieval -- A Comparative RAG Study
Jain, Amay, Cui, Liu, Chen, Si
Large language models like ChatGPT are increasingly used in classrooms, but they often provide outdated or fabricated information that can mislead students. Retrieval Augmented Generation (RAG) improves reliability of LLMs by grounding responses in external resources. We investigate two accessible RAG paradigms, vector-based retrieval and graph-based retrieval to identify best practices for classroom question answering (QA). Existing comparative studies fail to account for pedagogical factors such as educational disciplines, question types, and practical deployment costs. Using a novel dataset, EduScopeQA, of 3,176 questions across academic subjects, we measure performance on various educational query types, from specific facts to broad thematic discussions. We also evaluate system alignment with a dataset of systematically altered textbooks that contradict the LLM's latent knowledge. We find that OpenAI Vector Search RAG (representing vector-based RAG) performs well as a low-cost generalist, especially for quick fact retrieval. On the other hand, GraphRAG Global excels at providing pedagogically rich answers to thematic queries, and GraphRAG Local achieves the highest accuracy with the dense, altered textbooks when corpus integrity is critical. Accounting for the 10-20x higher resource usage of GraphRAG (representing graph-based RAG), we show that a dynamic branching framework that routes queries to the optimal retrieval method boosts fidelity and efficiency. These insights provide actionable guidelines for educators and system designers to integrate RAG-augmented LLMs into learning environments effectively.
- North America > United States > Pennsylvania > Chester County > West Chester (0.04)
- Asia > Philippines (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Research Report (1.00)
- Overview > Fact Book (0.34)
HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs
Nguyen, Tin, Bolton, Logan, Taesiri, Mohammad Reza, Nguyen, Anh Totti
An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the query. That is, given an input question, LLMs would first re-format the question to add XML tags highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Interestingly, in few-shot settings, HoT outperforms vanilla chain of thought prompting (CoT) on a wide range of 17 tasks from arithmetic, reading comprehension to logical reasoning. When asking humans to verify LLM responses, highlights help time-limited participants to more accurately and efficiently recognize when LLMs are correct. Yet, surprisingly, when LLMs are wrong, HoTs tend to make users believe that an answer is correct.
- Europe > Ukraine (0.27)
- Asia (0.27)
- North America > Mexico > Veracruz (0.14)
- (5 more...)
- Research Report (1.00)
- Overview > Fact Book (0.34)
- Leisure & Entertainment > Sports > Football (1.00)
- Health & Medicine > Therapeutic Area (1.00)
FactFlow: Automatic Fact Sheet Generation and Customization from Tabular Dataset via AI Chain Design & Implementation
Vu, Minh Duc, Chen, Jieshan, Xing, Zhenchang, Lu, Qinghua, Xu, Xiwei, Fu, Qian
With the proliferation of data across various domains, there is a critical demand for tools that enable non-experts to derive meaningful insights without deep data analysis skills. To address this need, existing automatic fact sheet generation tools offer heuristic-based solutions to extract facts and generate stories. However, they inadequately grasp the semantics of data and struggle to generate narratives that fully capture the semantics of the dataset or align the fact sheet with specific user needs. Addressing these shortcomings, this paper introduces \tool, a novel tool designed for the automatic generation and customisation of fact sheets. \tool applies the concept of collaborative AI workers to transform raw tabular dataset into comprehensive, visually compelling fact sheets. We define effective taxonomy to profile AI worker for specialised tasks. Furthermore, \tool empowers users to refine these fact sheets through intuitive natural language commands, ensuring the final outputs align closely with individual preferences and requirements. Our user evaluation with 18 participants confirms that \tool not only surpasses state-of-the-art baselines in automated fact sheet production but also provides a positive user experience during customization tasks.
- Media > Film (0.46)
- Information Technology > Security & Privacy (0.46)
STRUX: An LLM for Decision-Making with Structured Explanations
Lu, Yiming, Hu, Yebowen, Foroosh, Hassan, Jin, Wei, Liu, Fei
Countless decisions shape our daily lives, and it is paramount to understand the how and why behind these choices. In this paper, we introduce a new LLM decision-making framework called STRUX, which enhances LLM decision-making by providing structured explanations. These include favorable and adverse facts related to the decision, along with their respective strengths. STRUX begins by distilling lengthy information into a concise table of key facts. It then employs a series of self-reflection steps to determine which of these facts are pivotal, categorizing them as either favorable or adverse in relation to a specific decision. Lastly, we fine-tune an LLM to identify and prioritize these key facts to optimize decision-making. STRUX has been evaluated on the challenging task of forecasting stock investment decisions based on earnings call transcripts and demonstrated superior performance against strong baselines. It enhances decision transparency by allowing users to understand the impact of different factors, representing a meaningful step towards practical decision-making with LLMs.
- North America > United States (0.28)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- (3 more...)
- Financial News (1.00)
- Overview > Fact Book (0.55)
DELTA: Pre-train a Discriminative Encoder for Legal Case Retrieval via Structural Word Alignment
Li, Haitao, Ai, Qingyao, Han, Xinyan, Chen, Jia, Dong, Qian, Liu, Yiqun, Chen, Chong, Tian, Qi
Recent research demonstrates the effectiveness of using pre-trained language models for legal case retrieval. Most of the existing works focus on improving the representation ability for the contextualized embedding of the [CLS] token and calculate relevance using textual semantic similarity. However, in the legal domain, textual semantic similarity does not always imply that the cases are relevant enough. Instead, relevance in legal cases primarily depends on the similarity of key facts that impact the final judgment. Without proper treatments, the discriminative ability of learned representations could be limited since legal cases are lengthy and contain numerous non-key facts. To this end, we introduce DELTA, a discriminative model designed for legal case retrieval. The basic idea involves pinpointing key facts in legal cases and pulling the contextualized embedding of the [CLS] token closer to the key facts while pushing away from the non-key facts, which can warm up the case embedding space in an unsupervised manner. To be specific, this study brings the word alignment mechanism to the contextual masked auto-encoder. First, we leverage shallow decoders to create information bottlenecks, aiming to enhance the representation ability. Second, we employ the deep decoder to enable translation between different structures, with the goal of pinpointing key facts to enhance discriminative ability. Comprehensive experiments conducted on publicly available legal benchmarks show that our approach can outperform existing state-of-the-art methods in legal case retrieval. It provides a new perspective on the in-depth understanding and processing of legal case documents.
- Asia > China (0.14)
- North America > United States > District of Columbia > Washington (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Overview > Fact Book (1.00)
- Research Report > New Finding (0.88)
Multi-Query Focused Disaster Summarization via Instruction-Based Prompting
Seeberger, Philipp, Riedhammer, Korbinian
Automatic summarization of mass-emergency events plays a critical role in disaster management. The second edition of CrisisFACTS aims to advance disaster summarization based on multi-stream fact-finding with a focus on web sources such as Twitter, Reddit, Facebook, and Webnews. Here, participants are asked to develop systems that can extract key facts from several disaster-related events, which ultimately serve as a summary. This paper describes our method to tackle this challenging task. We follow previous work and propose to use a combination of retrieval, reranking, and an embarrassingly simple instruction-following summarization. The two-stage retrieval pipeline relies on BM25 and MonoT5, while the summarizer module is based on the open-source Large Language Model (LLM) LLaMA-13b. For summarization, we explore a Question Answering (QA)-motivated prompting approach and find the evidence useful for extracting query-relevant facts. The automatic metrics and human evaluation show strong results but also highlight the gap between open-source and proprietary systems.
- Europe > Germany > Bavaria > Middle Franconia > Nuremberg (0.14)
- Europe > Germany > Hesse > Darmstadt Region > Wiesbaden (0.05)
- North America > United States > North Carolina > Wake County > Raleigh (0.04)
- (6 more...)
Single Sequence Prediction over Reasoning Graphs for Multi-hop QA
Ramesh, Gowtham, Sreedhar, Makesh, Hu, Junjie
Recent generative approaches for multi-hop question answering (QA) utilize the fusion-in-decoder method~\cite{izacard-grave-2021-leveraging} to generate a single sequence output which includes both a final answer and a reasoning path taken to arrive at that answer, such as passage titles and key facts from those passages. While such models can lead to better interpretability and high quantitative scores, they often have difficulty accurately identifying the passages corresponding to key entities in the context, resulting in incorrect passage hops and a lack of faithfulness in the reasoning path. To address this, we propose a single-sequence prediction method over a local reasoning graph (\model)\footnote{Code/Models will be released at \url{https://github.com/gowtham1997/SeqGraph}} that integrates a graph structure connecting key entities in each context passage to relevant subsequent passages for each question. We use a graph neural network to encode this graph structure and fuse the resulting representations into the entity representations of the model. Our experiments show significant improvements in answer exact-match/F1 scores and faithfulness of grounding in the reasoning path on the HotpotQA dataset and achieve state-of-the-art numbers on the Musique dataset with only up to a 4\% increase in model parameters.
- Research Report (0.50)
- Overview > Fact Book (0.34)
- Media > Film (1.00)
- Leisure & Entertainment (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.49)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.35)
APOLLO: An Optimized Training Approach for Long-form Numerical Reasoning
Sun, Jiashuo, Zhang, Hang, Lin, Chen, Gong, Yeyun, Guo, Jian, Duan, Nan
Long-form numerical reasoning in financial analysis aims to generate a reasoning program to calculate the correct answer for a given question. Previous work followed a retriever-generator framework, where the retriever selects key facts from a long-form document, and the generator generates a reasoning program based on retrieved facts. However, they treated all facts equally without considering the different contributions of facts with and without numbers. Meanwhile, the program consistency were ignored under supervised training, resulting in lower training accuracy and diversity. To solve these problems, we proposed APOLLO to improve the long-form numerical reasoning framework. For the retriever, we adopt a number-aware negative sampling strategy to enable the retriever to be more discriminative on key numerical facts. For the generator, we design consistency-based reinforcement learning and target program augmentation strategy based on the consistency of program execution results. Experimental results on the FinQA and ConvFinQA leaderboard verify the effectiveness of our proposed method, achieving the new state-of-the-art.
- Research Report (0.40)
- Overview > Fact Book (0.34)